Stream compression and decompression

ABSTRACT

A method for compressing a sequence of records, each record comprising a sequence of fields, comprises steps of buffering a record in a line of a matrix, reordering the lines of the matrix according to locality sensitive hash values of the buffered records such that records with similar contents in corresponding fields are placed in proximity, and consolidating fields in columns of the matrix into a block of codes. In this, consolidating yields codes of one of a first type comprising a sequence of individual fields and a second type comprising a sequence of fields with at least one repetition. The second type of code comprises a presence field indicating repeated fields and an iteration field indicating a number of respective repetitions. Decompression of the records from the block codes compressed above is also described.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is based upon and claims priority from prior U.S.patent application Ser. No. 13/553,457, Attorney Docket No.CH920110018US1, filed on Jul. 19, 2012, which in turn is based upon andclaims priority from prior European Patent Application No. 11175194.4,filed on Jul. 22, 2011, the entire disclosure of each are herebyincorporated by reference in their entirety.

FIELD OF THE INVENTION

The present invention relates to a system and a method for data streamcompression and another system and another method for data streamdecompression.

The invention also relates to a computer program product for carryingout one of the methods.

BACKGROUND OF THE INVENTION

In many technical applications, a stream of data is generated which mustbe recorded for later analysis. Such applications may be found indomains such as network forensics and cyber-security, scientific dataanalysis (e.g. archiving and analysis of astronomical measurements), andmonitoring applications (e.g. in sensor networks). Often, the amount ofdata generated per time is very large so that extremely fast processingis mandatory to prevent the processing systems from clogging up withdata.

In known archiving solutions, accessing the archived data may be slowbecause searching and decompressing the requested data is acomputationally intensive task.

SUMMARY OF THE INVENTION

Embodiments of aspects of the invention provide methods with thefeatures of claims 1 and 3, computer program products with the featuresof claim 10 and apparatuses with the features of claims 11 and 12.Dependent claims denote advantageous embodiments.

According to an embodiment of a first aspect of the invention, a methodfor compressing a sequence of records, each record comprising a sequenceof fields, comprises steps of buffering a record in a line of a matrix,reordering the lines of the matrix according to locality sensitive hashvalues of the buffered records in such a way that records with similarcontents in corresponding fields are placed in proximity, andconsolidating fields from columns of the matrix into a block of codes.In this, consolidating yields codes of a first type comprising asequence of individual fields or a second type comprising a sequence offields with at least one repetition. The second type of code maycomprise a presence field indicating repeated fields and an iterationfield indicating a number of respective repetitions. Presence anditeration fields may be implemented as bitmaps.

The resulting codes may advantageously provide a high compression ratioespecially for data that comprises fixed fields such as header fields ofTCP packets, time-tagged data or categorized measurements. Furthermore,the resulting codes may allow simplified access to the data without therequirement of decompressing all blocks containing data of interest.

Preferably, a sequence of less than three repeated fields is treated astwo individual fields. It was found that by this optimization, theoverhead of creating a code of the second type may be saved in manycases, resulting in a better performance of the compression method.

According to an embodiment of a second aspect of the invention, a methodfor decompressing records from a block of codes as mentioned abovecomprises steps of, for each code that is to be extracted from theblock, determining the type of the code, decompressing the sequence ofindividual fields if the code is the first type, and decompressing thesequence of repeated fields if the code is of the second type.

The proposed decompression method may result in fast decompression of ablock of codes. Steps of the methods may advantageously be carried outusing highly efficient operations such as memcpy( ) on a systemexecuting the decompression method.

Preferably, the method further comprises a preceding determination whichcodes must be decompressed for obtaining a predetermined record.

Through this, it is possible to either perform a full decompression ofthe block or to decompress only those codes that contain information ofthe predetermined record. The effort for a full decompression may betraded off against the effort of determination and the possibly morecomplex operations required for a partial decompression. Overallperformance of the decompression method may thus be further increased.

In one preferred embodiment, the fields are extracted into a matrix withlines representing records, the matrix being addressable in linearizedform in such a way that each field of each record has an offset addressin the linearized matrix and determination which codes must bedecompressed for obtaining a predetermined record comprises the steps ofdetermining an offset address for each field of the predetermined recordand, for each code in the block, determining a range of offset addressesthe code decompresses to and determining that the code must bedecompressed if the range comprises at least one offset address in theset.

This approach may avoid the decompression of any codes that decompressto information that is not required. A memory requirement of thedecompression method may be lowered and execution time may be reduced.

In a preferred embodiment, a grid of slots is logically superimposed onthe matrix in order to reduce the number of comparisons. Specifically,determination in this embodiment comprises the steps of determining anoffset address for each field of the predetermined record, linearlyrescaling the determined offset address with a predetermined factor andsetting a bit in a bitmap at the rescaled offset address to apredetermined value and, for each code in the block, determining a rangeof offset addresses the code decompresses to, linearly rescaling therange of offset addresses with the predetermined factor and determiningthat the code must be decompressed if the bitmap contains a bit with thepredetermined value at a rescaled address that falls within the rescaledrange of offset addresses.

By using the predetermined factor for rescaling all offset addresses,determination whether a code needs to be decompressed may require areduced number of comparisons, thus increasing performance of thedecompression method. Comparison of bit in the bitmap may exploit abit-counting assembly instruction as provided by many execution systems.The predetermined factor is a tunable parameter that allows to trade offprecision for performance.

Preferably, the decompression method further comprises a preceding stepof determining whether to decompress all codes of the block or to employpartial decompression as described above. This decision is based on atleast one of a compression ratio and a selectivity, wherein thecompression ratio indicates a quotient of the size of the block of codesby the size of the decompressed records and the selectivity indicates aquotient of the number of records to be decompressed by the number ofrecords in the block.

According to an embodiment both the compression ratio and theselectivity are parameters that may be accessible from the block ofcodes without actually decompressing any code of the block. The choiceof decompression strategies may thus be made fast and with respect toeach individual block. Empirical knowledge concerning the mostadvantageous decompression strategy for past blocks of code may beemployed for making the decision. Thus, characteristics of thecompressed data inside of the records as well as the machine executingthe decompression method may be exploited. By occasional verification ofthe choice made between partial and full decompression, an adaptivedecompression strategy may be implemented. Preferably, separate indexdata is provided for each block of codes, the index data comprisinginformation that allows computation of the compression ratio and/or theselectivity. The index data may further comprise information on the kindof data that is included inside of the block so that a search forspecific data may be run against the index data first and in case of ahit, records containing the actual data may be decompressed from thecorresponding blocks of codes.

In a preferred embodiment of the invention, the determination whether touse full or partial decompression is carried out on the basis of a tableindicating the method to use for different combinations of compressionratio and selectivity. Both compression ratio and selectivity may begraduated to form row and column addresses of the table, whereinelements of the table contain an indication of the preferred method touse. Thus, a very fast determination of the more advantageousdecompression strategy may be implemented. This approach may require amodest memory footprint and offer the opportunity to amend individualentries in the table to facilitate the above mentioned adaptive choice.

In one embodiment, elements of the table hold an indication of anaverage effort for carrying out full and/or partial decompression. Theeffort may e.g. be expressed in units of processing time or complexity.Thus, information in the table may be updated in order to adapt thechoice to the data.

According to an embodiment of a third aspect of the invention, acomputer program product implementing one of the above-indicated methodsis run on an processing unit or stored on a computer-readable medium.

According to an embodiment of a fourth aspect of the invention, acompression apparatus for compressing a sequence of records, each recordcomprising a sequence of fields, comprises a storage in matrixorganization for accepting and buffering a record in a line of thematrix, reordering means for determining locality sensitive hash valuesof the buffered records and for reordering the lines of the matrixaccording to the determined hash values in such a way that records withsimilar contents in corresponding fields are placed in proximity, andconsolidating means for consolidating fields in columns of the matrixinto a block of codes and for outputting the block of codes. Theconsolidating means are adapted to yield codes of a first typecomprising a sequence of individual fields or a second type comprising asequence of fields with at least one repetition. The second type of codemay comprise a presence field indicating repeated fields and aniteration field indicating the number of respective repetitions. Thepresence field may be a bitmap and the iteration field may be an array.

The compression apparatus may advantageously comprise a programmablecomputer or microcomputer executing the above-mentioned compressionmethod.

According to an embodiment of a fifth aspect of the invention, adecompression apparatus for decompressing records from a block of codes,the block having a structure like the block outputable by theabove-mentioned apparatus or the compression method above, compriseselements of determination means for determining the type of each of thecodes in the block, first decompression means for decompressing thesequence of individual fields if the code is of the first type andsecond decompression means for decompressing the sequence of repeatedfields if the code is of the second type. The two means fordecompression may be identical in some embodiments.

The decompression apparatus may preferably comprise a programmablecomputer or microcomputer executing the above-mentioned decompressionmethod.

Both the compression and the decompression apparatus may easily beintegrated into a communication or measurement system where largevolumes of data need to be archived and processed at a later time.Efficiency of data analysis may thus be improved which may lead to costand advantages and improved control over the data. Especially in dataanalysis that comprises access to a varying number of records per blockof codes, the decompression method and the decompression apparatus maybe advantageously used. Such analysis often must be performed in “needlein a haystack” type of searches, which are common for network forensics.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention will now be explained with reference to the encloseddrawings, in which:

FIG. 1 shows a flowchart of a data compression method;

FIG. 2 shows a flowchart of a data decompression method;

FIG. 3 shows a block diagram of a system for compression according tothe method of FIG. 1;

FIG. 4 shows a block diagram of a system for decompression according tothe method of FIG. 2;

FIG. 5 shows an illustration of partial and full decompression accordingto the decompression method of FIG. 2;

FIG. 6 shows an illustration of determining offset addresses in alinearized matrix according to the decompression method of FIG. 2;

FIG. 7 shows an illustration of a bitmap based determination of blocksto decompress according to the decompression method of FIG. 2; and

FIG. 8 shows an illustration of table based decision making whether touse full or partial decompression according to the method of FIG. 2.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

It should be understood that these embodiments are only examples of themany advantageous uses of the innovative teachings herein. In general,statements made in the specification of the present application do notnecessarily limit any of the various claimed inventions. Moreover, somestatements may apply to some inventive features but not to others. Ingeneral, unless otherwise indicated, singular elements may be in theplural and vice versa with no loss of generality.

The flowcharts and block in the figures illustrate the system, methods,as well as architecture, functions and operations executable by acomputer program product according to the embodiments of the presentinvention. In this regard, each block in the flowcharts or block mayrepresent a module, a program segment, or a part of code, which containsone or more executable instructions for performing specified logicfunctions. It should be noted that, in some alternative implementations,the functions noted in the blocks may also occur in a sequence differentfrom what is noted in the drawings. For example, two blocks shownconsecutively may be performed in parallel substantially or in aninverse order. This depends on relevant functions. It should also benoted that each block in the block diagrams and/or flowcharts and acombination of blocks in the block diagrams and/or flowcharts may beimplemented by a dedicated hardware-based system for performingspecified functions or operations or by a combination of dedicatedhardware and computer instructions.

Hereinafter, the principle and spirit of the present invention will bedescribed with reference to various exemplary embodiments. It should beunderstood that provision of these embodiments is only to enable thoseskilled in the art to better understand and further implement thepresent invention, not intended for limiting the scope of the presentinvention in any manner.

Hereinafter, various embodiments of the present invention will bedescribed in detail with reference to the drawings.

FIG. 1 shows a flowchart of a data compression method 100. Method 100starts in a step 105, in which one record of data is read. The recordcontains at least one field that is preferably of fixed size andposition in the record. Furthermore, it is preferred that the fieldstands at or near the beginning of the record. One example for such arecord is a TCP packet in which several fixed width header fields forsource and destination addresses, ports, protocol and other informationare provided.

In a step 110, the record is stored in a line of a table. The table hassufficient width to accommodate the whole record and will preferablycomprise a predetermined number of lines for accommodating acorresponding number of records.

In a step 115, it is determined whether the table is full after therecord has been stored in the table in step 110. In other embodiments ofthe method, it can also be tested in step 115 whether the table isfilled above a certain degree, e.g. 95%. If the table is found to havespace left, method 100 loops back to step 105. Otherwise, it proceedswith an optional step 120 in which an index for the table is created.The index will later be stored along with the compressed data of thetable so that the index can be searched for specific information beforethe corresponding data is extracted, i.e. decompressed for furtheranalysis. The index will preferably be searchable for contents of theone or more header field and indicate if the block of codes comprises arecord with fields containing such contents. The index may also comprisestatistical and/or summary information on the records and/or the codes.Such information may comprise a compression ratio, a compressed size anda decompressed size of the records.

In a step 125, a proximity-sensitive hash is determined for each line ofthe table. As is known in the art, a hash value for an object representsan expression, often of fixed size, that is characteristic for everyproperty of the object. Should the object be changed at all, thecorresponding hash will also change. A locality sensitive hash value isa special kind of hash value that has the property that it changes withthe object in an analogous way. That is, two objects that are largelyidentical and only differ in a few places will have locality sensitivehash values that differ only slightly. As a measure for the differencebetween two objects or hash values, the Hamming distance may beemployed.

In a step 130, the table is reordered according to the generatedlocality sensitive hash values, thus placing records with similarcontent in corresponding fields in relative proximity inside the table.In a following step 135, the matrix may be transposed. The transpositionof step 135 does not have to be physically carried out on the table.Alternatively, it may be sufficient to adapt the addressing of thematrix so that a columnar access instead of a line-wise access ispossible. On a computer system, the table may be kept in a memory inlinearized form. In this case, accessing the matrix in columnar form maybe done through a well-known address transposition mechanism.

In a step 140, the matrix is scanned in columnar fashion and recurringfields are counted. In a preferred embodiment of the invention, asequence of two identical consecutive fields is treated as twoindividual fields, while three or more identical consecutive fields aretreated as a recurrence.

In a following step 145, the collected field is filed into a code. Thereare two types of codes available, a first type comprising a sequence ofindividual fields and a second type that additionally comprises apresence field and an iteration field. The presence field, which is bestored as a bitmap preceding the sequence of fields, indicates which ofthe fields represent a plurality of identical consecutive fields. Theiteration field, which may be stored as a vector appending the sequenceof fields, denotes the number of iterations for each field indicated bythe presence field. The first type code my thus be considered a specialcase of the second type code in which no recurring fields are stored.Therefore, in step 145, the collected field is filed into a type twocode and the corresponding elements in the presence and iteration fieldsare set if the collected field represents a plurality of fields.

In a step 150, it is determined whether the generated code has reachedits capacity. If there is room left inside of the code, method 100 loopsback to step 140. Otherwise, in a step 155 it is determined whether thegenerated code contains any repeated fields. If this is the case, thesecond-type code is finalized in a step 160 in which an indication ofthe type of code is introduced. This indication may be stored as leadinginformation in the generated code.

If it is determined in step 155 that there are no recurring fieldsinside of the generated code, the presence and iteration fields areremoved from the second-type code, thus making it a first-type code. Ina step 170, the first-type code is finalized by introducing anindication for the type of code.

After one of steps 160 and 170, the finalized code of type one or typetwo is output and method 100 may terminate or loop back to step 105 tocompress more records into a code. A predetermined number of generatedcodes may form a block of codes that may be supplied with theinformation generated in step 120.

FIG. 2 shows a flowchart of a data decompression method 200. Method 200is adapted to work on a block of codes as obtainable from compressionmethod 100.

Decompression method 200 begins in a step 205, in which a code isaccepted. The code is of type one or type two as described above withrespect to compression method 100.

In a following step 210, one or several requests for records are read.This step may be omitted if the complete contents of the code are to bedecompressed. The request may have been generated on the basis of asearch against the index described above with respect to step 120 ofcompression method 100. The result of the search may return anindication which fields inside the code contain the sought information.

In a step 215, it is determined whether partial decompression or fulldecompression of the code will be performed. In the case of fulldecompression, the decompression is carried out in a step 220 in whicheach of the fields inside the code is expanded into a column of amatrix, filling in appropriate recurrences of iterated fields in thecase of iterated fields of a type two code. The matrix may then belogically transposed as described above with respect to step 135 ofcompression method 100 so that all records which were compressed in thecode may be accessed as lines of the matrix. If required, the lines ofthe matrix may be reordered e.g. according to contents of the lines ofthe matrix, i.e. contents of the fields of the records.

If partial decompression is to be used, method 200 continues with a step225 in which it is determined which parts of the requested records, i.e.lines of the matrix, are comprised by which fields of a code. If thematrix is stored in linearized form, e.g. in a computer memory, therequired matrix elements will have addresses that can be expressed asoffsets from a first address of the whole matrix. It may be advantageousto determine matches between the fields of a code and fields of a recordon the basis of offset addresses of the matrix.

In an optional step 230, a bitmap is generated containing a number ofbits that is preferably lower than the number of elements inside of thematrix. Each offset address in the matrix may thus be scaled to a bitaddress inside the bitmap by multiplying the offset address with afactor that is the quotient of the number of elements in the matrix bythe number of elements in the bitmap. Bits of the bitmap that correspondto matrix offset addresses with information on the requested recordswill be set, while other bits will be reset (or vice versa). Thus, thebitmap represents an indication of codes to decompress wherein theindication is more coarse-grained than the offset addresses generated instep 225 and thus requires less comparison operations for identificationof the corresponding fields to be decompressed from code, as carried outbelow.

In a step 235, a field is retrieved from the code. In a step 240, it isdetermined to which offset addresses of the matrix the code expands to.The determined range of addresses may be scaled in a step 245 with thefactor of step 230 above.

In a step 250, it is determined whether the scaled range comprises a setbit in the bitmap of step 230. If this is the case, the field isdecompressed in a step 255. Otherwise, in a step 260, it is determinedif the present field is the last field in the code. If this is not thecase, decompression 200 loops back to step 235 to proceed with the nextfield from the code.

Should the last field have been reached in step 260 or the fulldecompression been performed in step 220, the requested records arereturned in a step 265.

FIG. 3 shows a block diagram of a system 300 for compression accordingto method 100 of FIG. 1.

System 300 comprises a storage 305 which holds a matrix 310 and aprocessing unit 315. The storage 305 may be filled through a firstinterface 320 and the processing unit 315 may output codes through asecond interface 325. In other embodiments of the invention, the firstinterface 315 may also be connected to the processing unit 315.

System 300 is adapted to accept a stream of records through the firstinterface 315 and to output one or several blocks of codes withcompressed records through the second interface 325. The system 300 isparticularly adapted to carry out compression method 100 of FIG. 1. Thecodes provided through the second interface 325 will have the structureof the first and second type codes mentioned above with respect toFIG. 1. The system 300 may also be adapted to generate and associate anindex with one or several generated blocks of codes. The index maycontain information as mentioned above with respect to step 120 ofcompression method 100 of FIG. 1.

FIG. 4 shows a block diagram of a system 400 for decompression accordingto the decompression method 200 of FIG. 2. Decompression system 400 isparticularly suited for executing decompression method 200 of FIG. 2.Like compression system 300 of FIG. 3, the decompression system 400 ofFIG. 4 may advantageously be implemented through a programmable computeror microcomputer. System 400 comprises a storage 405 containing a matrix410, a processing unit 415, a first interface 420 and a second interface425, all of which elements correspond to elements in the compressingsystem 300 of FIG. 3. Furthermore, there is provided a firstdecompressor 430 for decompressing codes of the first type and a seconddecompressor 435 for decompressing codes of the second type, both codesas mentioned above with respect to FIG. 1.

In another embodiment of the invention, one or both of the decompressors430, 435 may be comprised by processing unit 415.

FIG. 5 shows an illustration of partial and full decompression accordingto the decompression method 200 of FIG. 2.

In the center of FIG. 5, a matrix 505, corresponding to the table ofstep 110, the table referenced above with respect to FIG. 2, andmatrices 310, 410 of systems 300 and 400, respectively. Table 505 isorganized in columns 510 and rows (lines) 515. The table 505 is adaptedto hold all the decompressed information of one block 518 of codes 520.Each code 520 is in the format of a first code or a second code, asdescribed above with respect to step 145 in compression method 100 inFIG. 1.

For a full decompression of the block 518 of codes 520, all codes 520are one-by-one expanded into fields 525 and filed into columns 510 ofmatrix 505. It can be seen that e.g. the codes 620 labeled “1” and “2”expand into fields 525 that represent much longer portions of a column510 of the table 505 than e.g. the fields 525 corresponding to codes 520labeled “8” and “9”. This is due to the fact that in codes 520 labeled“1” and “2”, a larger number of repeated fields exist so that thecompression ratio of these codes 520 is higher.

After the thirteen codes 520 of the block of codes 518 have beenexpanded into the table 505, all columns 510 are assigned with fields ofdecompressed records. Each decompressed record 530 is found in one row515 of table 505. In order to access the individual records 530, asimple line-wise access to table 505 can be facilitated.

Partial decompression of block 518 will now be explained with referenceto an example in which only the records of rows 515 labeled R1 and R2are requested. In the table 505, fields 525 that need to be decompressedin order to contribute to a portion of a row R1 or R2 are shaded. Theshaded fields 525 comprise those with labels 2, 3, 6 and 11.

The number of rows 515 an expanded field 525 occupies afterdecompression may be determined by multiplying the size of the fieldwith the number of iterations associated with the field 525. Thus, theposition of a successive field inside of a column 510 in matrix 505 canbe determined without actually decompressing any previous field 525.Therefore, the fields 525 that are not shaded inside of table 505 neednot be decompressed in order to provide the correct contents of rows R1and R2.

Exemplary records 530 which are stored inside rows 515 labeled R1 and R2are shown in the right-hand side of FIG. 5. Each record 530 comprisesseveral fields 535.

FIG. 6 shows an illustration 600 for determining offset addresses in alinearized matrix according to the decompression method 200 of FIG. 2.

In the upper portion of FIG. 6, matrix 505 with columns 510 and rows 515of FIG. 5 is shown in another embodiment. The number n of columns 510 is4 and the number m of rows (lines) 515 is 3. Elements of matrix 505 areshown with exemplary content.

In the lower portion of FIG. 6, matrix 505 is shown in a linearizedform. The linearized form is used e.g. if matrix 505 is stored in acomputer memory with linear addresses. In this case, the top-mostelement of matrix 505 which corresponds to the first column 510 and thefirst row 515, is assigned a base address of matrix 505. Other elementscan be accessed through an offset address that denotes a number ofaddresses from the base address. Consecutive addresses correspond tosuccessive elements of the same column 510 and increasing rows 515.After the elements of the first column 510, the elements of the nextcolumn 510 are stored in a similar manner. The rest of the columns 510of table 505 follow in the same way.

In order to access the linearized matrix 505 in a row-based fashion,starting e.g. in the first column 510 and the third row 515, the firstelement is found at offset address three and consecutive elements of thesame row 515 are accessed at intervals of m (=3) elements from there.Thus, transposing matrix 505 in its linearized form may be implementedby simply changing the address schemes of row-based addressing andcolumn-based addressing.

FIG. 7 shows an illustration 700 of bitmap based determination of fieldsto decompress according to the decompression method 200 of FIG. 2.

In the shown example, a series of five fields 525 of FIG. 5 are expandedinto table 505 which is displayed in its linearized form in the centerof FIG. 7. The linearized matrix 505 is subdivided into s segments.Borders between the s segments do not necessarily coincide with bordersbetween elements of the linearized matrix 505 since the numbers ofsegments s is smaller than the number of elements inside of matrix 505.

Each of the s segments is assigned one bit in a bitmap 705. Thecorresponding bit inside bitmap 705 is set if and only if an element ofmatrix 505 that lies within the boundaries of the corresponding segments needs to be decompressed.

The scheme for determining which codes 520 need to be decompressed for agiven row or a set of rows, as explained above with respect to FIG. 5,may be implemented with a lower number of comparisons to perform betweenoffset addresses of a requested row and offset addresses of an expandedfield 525 in the linearized matrix 505.

Once the bitmap 705 has been populated with the correct bits, the codes520 may be inspected one by one to determine the range of offsetaddresses they will expand to inside of matrix 505. Upper and lowerlimits of these ranges are transposed to indices of bitmap 705 bymultiplying the limits with the quotient of the number of elementsinside of matrix 505 by the number of segments s. Should any bit insideof bitmap 705 between the determined bitmap addresses be set, thecorresponding code 520 needs to be decompressed.

The number of segments s is a tunable parameter. The larger the numberof segments s is with respect to the number of elements inside thematrix 505, the higher the precision of the determination will be, i.e.the lower the probability is that any code 520 will be decompressed thatcontains no information that is part of a requested row. On the otherhand, the lower the number of segments is, the less comparisons need tobe drawn in order to figure out whether or not a code 520 needs to bedecompressed.

FIG. 8 shows an illustration 800 of a table based decision makingprocess whether to use full or partial decompression of a code accordingto the decompression method 200 of FIG. 2. The decision making discussedbelow is specifically adapted to be carried out inside of step 215 ofmethod 200.

A table or matrix 805 is provided in ten columns 810 and ten rows(lines) 815. Other embodiments may employ different numbers of columns810 and/or rows 815. Columns 810 are assigned to a selectivity parametergiven as a percentage, and rows 815 are assigned to a compression ratioparameter also given as a percentage.

As described above, the selectivity describes how many records arerequested for decompression out of a code 520 with respect to the numberof records that are compressed inside the code 520. The compressionratio describes the ratio of the size of a code 520 by the size of thedecompressed matrix 505. Both the selectivity and the compression ratioparameter are obtainable for a code 520 without decompressing any of thefields 525.

In each element of table 805, there is stored an indication on whetherit is more advantageous to carry out full decompression of a block 518of codes 520 or a partial decompression for the given combination ofselectivity and compression ratio.

For instance, if selectivity is high and compression ratios are moderateor low, full decompression may be advisable, as can be seen from thewhite shading of elements of table 805 towards the upper right-handcorner. On the other hand, partial decompression may be advisable whenselectivity is low and the decompression ratio is moderate, as can beseen from the dark shading of table 805 towards the left. In the givenexample of FIG. 8, no indication is available when the compression ratiois high, as is indicated by the chequer pattern towards the lower end oftable 805.

To populate table 805, representative codes 520 for a kind of data ofinterest may be decompressed both partially and fully in order to findout which strategy consumes less computing effort. The computing effortmay for instance be given as a number of clock cycles of a processingunit. Table 805 may also be populated during decompression according tomethod 200 of FIG. 2 in that every time a combination betweenselectivity and compression ratio is found for which there is noinformation available inside of table 805, both full and partialdecompression are carried out and an indication of the less effortconsuming method may be written into the corresponding cell of table805. Computing efforts in table 805 may also be verified occasionallyand updated if necessary.

It should be noted that the above depiction is only exemplary, notintended for limiting the present invention. In other embodiments of thepresent invention, this method may have more, or less, or differentsteps, and numbering the steps is only for making the depiction moreconcise and much clearer, but not for stringently limiting the sequencebetween each steps, while the sequence of steps may be different fromthe depiction.

Thus, in some embodiments, the above one or more optional steps may beomitted. Specific implementation of each step may be different from thedepiction. All these variations fall within the spirit and scope of thepresent invention.

The present invention may adopt a form of hardware embodiment, softwareembodiment or an embodiment comprising hardware components and softwarecomponents. In a preferred embodiment, the present invention isimplemented as software, including, without limitation to, firmware,resident software, micro-code, etc.

Moreover, the present invention may be implemented as a computer programproduct usable from computers or accessible by computer-readable mediathat provide program code for use by or in connection with a computer orany instruction executing system. For the purpose of description, acomputer-usable or computer-readable medium may be any tangible meansthat can contain, store, communicate, propagate, or transport theprogram for use by or in connection with an instruction executionsystem, apparatus, or device.

The medium may be an electric, magnetic, optical, electromagnetic,infrared, or semiconductor system (apparatus or device), or propagationmedium. Examples of the computer-readable medium would include thefollowing: a semiconductor or solid storage device, a magnetic tape, aportable computer diskette, a random access memory (RAM), a read-onlymemory (ROM), a hard disk, and an optical disk. Examples of the currentoptical disk include a compact disk read-only memory (CD-ROM), compactdisk-read/write (CR-ROM), and DVD.

A data processing system adapted for storing or executing program codewould include at least one processor that is coupled to a memory elementdirectly or via a system bus. The memory element may include a localmemory usable during actually executing the program code, a mass memory,and a cache that provides temporary storage for at least one portion ofprogram code so as to decrease the number of times for retrieving codefrom the mass memory during execution.

An Input/Output or I/O device (including, without limitation to, akeyboard, a display, a pointing device, etc.) may be coupled to thesystem directly or via an intermediate I/O controller.

A network adapter may also be coupled to the system such that the dataprocessing system can be coupled to other data processing systems,remote printers, or storage devices via an intermediate private orpublic network. A modem, a cable modem, and an Ethernet card are merelyexamples of a currently usable network adapter.

It is to be understood from the foregoing description that modificationsand alterations may be made to the respective embodiments of the presentinvention without departing from the true spirit of the presentinvention. The description in the present specification is intended tobe illustrative and not limiting. The scope of the present invention islimited by the appended claims only.

What is claimed is:
 1. An apparatus for compressing a sequence ofrecords, each record comprising a sequence of fields, the methodcomprising, comprising: a memory; a processor communicatively coupled tothe memory for performing; buffering a record in a line of a matrix;reordering the lines of the matrix according to locality sensitive hashvalues of the buffered records such that records with similar contentsin corresponding fields are placed in proximity; consolidating fields incolumns of the matrix into a block of codes; and consolidating yieldscodes of one of a first type comprising a sequence of individual fieldsand a second type comprising a sequence of fields with at least onerepetition; wherein the second type of code comprises a presence fieldindicating repeated fields and an iteration field indicating the numberof respective repetitions.
 2. The apparatus according to claim 1,wherein a sequence of less than three repeated fields is treated as twoindividual fields.
 3. The apparatus according to claims 1, furthercomprising decompressing records from a block of codes, for each codethat is to be extracted from the block: determining the type of thecode; decompressing the sequence of individual fields if the code is ofthe first type; and decompressing the sequence of repeated fields if thecode is of the second type.
 4. The apparatus according to claims 3,further comprising a preceding determination which codes must bedecompressed for obtaining a predetermined record.
 5. The apparatusaccording to claims 4, wherein the fields are extracted into a matrixwith lines representing records, the matrix is addressable in linearizedform such that each field of each record has an offset address in thelinearized matrix, and determination comprises the steps of: determiningan offset address for each field of the predetermined record; for eachcode in the block: determining a range of offset addresses the codedecompresses to; and determining that the code must be decompressed ifthe range comprises at least one offset address in the set.
 6. Theapparatus according to claims 4, wherein the fields are extracted into amatrix with lines representing records, the matrix is addressable inlinearized form such that each field of each record has an offsetaddress in the linearized matrix, and determination comprises the stepsof: determining an offset address for each field of the predeterminedrecord; linearly rescaling the determined offset addresses with apredetermined factor and setting a bit in a bitmap at the rescaledoffset address to a predetermined value; for each code in the block:determining a range of offset addresses the code decompresses to;linearly rescaling the range of offset addresses with the predeterminedfactor; and determining that the code must be decompressed if the bitmapcontains a bit with the predetermined value at a rescaled address thatfalls within the rescaled range of offset addresses.
 7. The apparatusaccording to claims 4, further comprising a preceding determinationwhether to decompress all codes of the block or to employ partialdecompression according, wherein the determination is carried out on thebasis of at least one of: a compression ratio indicating a quotient of asize of the block of codes by a size of the decompressed records; and aselectivity indicating a quotient of the number of records to bedecompressed by the number of records in the block.
 8. The apparatusaccording to claims 7, wherein determination whether to use full orpartial decompression is done on the basis of a table indicating themethod to use for different combinations of compression ratio andselectivity.
 9. The apparatus according to claims 8, wherein elements ofthe table hold an indication of an average effort for carrying out fulland/or partial decompression.
 10. A nontransitory machine readablemedium encoded with a program for compressing a sequence of records,each record comprising a sequence of fields, the program comprisinginstructions for: buffering a record in a line of a matrix; reorderingthe lines of the matrix according to locality sensitive hash values ofthe buffered records such that records with similar contents incorresponding fields are placed in proximity; consolidating fields incolumns of the matrix into a block of codes; and consolidating yieldscodes of one of a first type comprising a sequence of individual fieldsand a second type comprising a sequence of fields with at least onerepetition; wherein the second type of code comprises a presence fieldindicating repeated fields and an iteration field indicating the numberof respective repetitions.
 11. The nontransitory machine readable mediumaccording to claim 10, wherein a sequence of less than three repeatedfields is treated as two individual fields.